We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

Mount Laurel, NJ

SUMMARY:

  • Over 8+ years of professional IT experience which includes around 4 years of hands on experience inHadoopusing Cloudera, Hortonworks, andHadoopworking environment includes Map Reduce, HDFS, HBase, Zookeeper, Oozie, Yarn, Hive, Sqoop, Pig, Cassandra and Flume.
  • Having a Good experience in application of Data processing technology stacks. For example,Hadoop, SQL, Snowflake, Spark, Hadoop ecosystem.
  • Having experience on RDD architecture and implementing spark operations on RDD and optimizing transformations and actions in spark.
  • Have a good knowledge with data warehouses RDBMS like Oracle, Snowflake and Teradata.
  • Strong programming skills in Python including Libraries (NumPy, Pandas, matplotlib, Sklearn etc.)
  • Strong experience in ETL, Data warehouse testing.
  • Hands on Experience on Google Cloud Platform (GCP) Like Kubernetes, Compute Engine, App Engine, Cloud Functions, Cloud Run, Storage, BigQuery, Stackdriver Monitioring and Load Balancing.
  • Having a good Knowledge of Data Visualization tools like SPLUNK and Kibana.
  • Experience in installation and setup of various Kafka producers and consumers along with the Kafka brokers and topics.
  • Excellent experience in various AWS services like EC2, Lambda, EMR, CloudWatch, SNS etc.
  • In - depth Understanding of SnowFlake cloud technology.
  • Understanding of SnowFlake Multi-cluster Size and Credit usage
  • Experienced in managingHadoopcluster using Cloudera Manager Tool.
  • Experience in Big data/Hadoop testing.
  • Machine learning frameworks & statistical analysis with Python. By using the lib Sklearn.
  • Remote login to Virtual Machines to troubleshoot, monitor and deploy applications.
  • Managing Windows 2012 servers, troubleshooting IP issues and working with different support teams.
  • Very good understanding/knowledge ofHadoopArchitecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name node, and MapReduce concepts.
  • Great hands on experience with Pyspark for using Spark libraries by using python scripting for data analysis.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems (RDBMS)/mainframe and vice-versa.
  • Experience in Distributed file Systems(Hadoop, Cassandra etc) and proficiency with Spark and Spark Streaming.
  • Strong experience in writing UNIX shell scriptsand python script.
  • Involved in developing web-services using REST, HBase Native API Client to query data from HBase.
  • Experienced in working with structured data using Hive QL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
  • Experience of working on data formats like Avro, Parquet.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms inHadoopusing Spark Context, Spark-SQL, Data Frame, Pair RDD's and Datasets.
  • Expertise in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Experience in working with different data sources like Flat files, XML files, log files and Database.
  • Real time exposure to Amazon Web Services, AWS command line interface, and AWS data pipeline.
  • Work experience with cloud infrastructure like Amazon Web Services (AWS).
  • Extensive experience in working with various distributions ofHadooplike enterprise versions of Cloudera(CDH4/CDH5), Horton works and good knowledge on MAP distribution, IBMBigInsights and Amazon's EMR (Elastic Map Reduce).
  • Experience in design and develop the POC in Spark using python to compare the performance of Spark with Hive and SQL/Oracle.
  • Experience with Agile Methodology, Scrum Methodology.
  • Expertise in application development using Python, RDBMS, and UNIX shell scripting.
  • Extensive experience with SQL, PL/SQL and database concepts.
  • Worked on ingesting log data intoHadoopusing Flume.
  • Experience in optimizing the queries by creating various clustered, non-clustered indexes and indexed views using and data modeling concepts.
  • Experience with Source Code Management tools and proficient in GIT, Bitbucket etc.
  • Excellent interpersonal and communication skills, creative, research-minded with problem solving skills.

TECHNICAL SKILLS:

Big Data Technologies: HDFS, Hive, MapReduce, Pig, Sqoop, Flume, Oozie, Hadoop distribution, and HBase, Spark, Yarn, Zookeeper, Kafka, Spark SQL, ImpalaMachine Learning: Pandas, NumPy, Matplotlib, Seaborn, Sklearn.

Programming languages: Core Java, Spring Boot, Spark, JIT.

Databases: MySQL, SQL/PL-SQL, MS-SQL Server 20012/16,SnowFlake, Oracle 10g/11g/12c

Scripting/Web Languages: SQL, Shell, Perl, Python.

No Sql Databases: Cassandra, HBASE, mongoDB.

Operating Systems: Linux, Windows XP/7/8/10, Mac.

Software Life Cycle: SDLC, Waterfall and Agile models.

Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UI, ANT, Maven, Alteryx, Visio.

Data Visualization Tolls: Tableau, Splunk, Kibana

Cloud Environment: Google Cloud Platform (GCP), AWS, EC2, S3, EMR, RDS, Lambda, CloudWatch, SNS, Auto scaling, Terraform.

WORK EXPERIENCE:

Data Engineer

Confidential, Mount Laurel, NJ

Responsibilities:

  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Created multi-nodeHadoop, Spark clusters in AWS instances to generate terabytes of data & stored it in AWS HDFS.
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of spark jobs by inspecting and reviewing log files.
  • Extensively implemented terraform to create/mange GCP projects, Kubernetes Clusters and other GCP resources.
  • Build data systems and pipelines to Evaluate business needs and objectives.
  • Interpret trends and patterns. Conduct complex data analysis and report on results
  • Using the Data Analytics tools Splunk and Kibana to make the Dashboard.
  • Prepare data for prescriptive and predictive modeling by using the ML.
  • Build algorithms and prototypes by using the machine learning tools like Sklearn
  • Combine raw information from different sources by using the python script.
  • Explore ways to enhance data quality and reliability
  • Identify opportunities for data acquisition and Develop analytical tools and programs.
  • Played key role in MigratingTeradataobjects intoSnowFlake environment.
  • Developed Pyspark code to read data from Hive, group the fields and generate XML files and enhanced the Pyspark code to write the generated XML files to a directory to zip them to CDAs.
  • Developed Spark jobs using Python on top of Yarn/MRv2 for interactive and Batch Analysis. expertise with data models, data mining, and segmentation techniques.

Confidential, Overland Park, Kansas

Hadoop Developer

Responsibilities:

  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in aHadoop-based Data Lake.
  • Created multi-nodeHadoop, Spark clusters in AWS instances to generate terabytes of data & stored it in AWS HDFS.
  • Worked with JSON file format for StreamSets. Worked with Oozie workflow engine to manage interdependentHadoopjobs and to automate several types ofHadoopjobs.
  • Strong experience in ETL, Data warehouse testing.
  • Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Developed data pipeline using Flume, Sqoop, Spark with Scala to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Experience in Big data/Hadoop testing
  • Upgraded theHadoopcluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning, and slots configuration.
  • Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into SparkRDD.
  • Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
  • Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
  • Install OS and administratedHadoopstack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Worked with AWS team to manage servers in AWS.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE(EMR) and setupHadoopenvironment in AWS EC2 Instances.
  • Created Hive External tables and loaded the data in to tables and query data using HQL and worked with application teams to install operating system,Hadoopupdates, patches, version upgrades as required.
  • Developed Pyspark code to read data from Hive, group the fields and generate XML files and enhanced the Pyspark code to write the generated XML files to a directory to zip them to CDAs.

Environment: Hadoop, Map Reduce, HDFS, Hive, Java, Oozie, Linux, Eclipse, Putty, Winscp, Oracle 10g, PL/SQL, YARN, Spark, Scala, Python, Sqoop, DB2, java, AWS.

Confidential, Detroit, Michigan

HadoopDeveloper

Responsibilities:

  • Worked extensively onHadoopComponents such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming.
  • Working as Cloud Administrator on Microsoft Azure, involved in configuring virtual machines, storage accounts, resource groups.
  • Remote login to Virtual Machines to troubleshoot, monitor and deploy applications.
  • Managing Windows 2012 servers, troubleshooting IP issues and working with different support teams.
  • Experience in using Snowflake clone and time travel
  • Played key role in Migrating Teradata objects in to Snowflake environment.
  • Developed Map-Reduce programs to clean and aggregate the data.
  • Responsible for building scalable distributed data solutions usingHadoopand Spark.
  • Implemented Hive Ad-hoc queries to handle Member data from different data sources such as Epic and Centricity.
  • Implemented Hive UDF's and did performance tuning for better results.
  • Analyzed the data by performing Hive queries and running Pig Scripts.
  • Involved in loading data from UNIX file system to HDFS.
  • Implemented optimized map joins to get data from different sources to perform cleaning operations before applying the algorithms.
  • Experience in using Sqoop to import and export the data from Netezza and Oracle DB into HDFS and HIVE.
  • Implemented POC to introduce Spark Transformations.
  • Worked with NoSQL database HBase, MongoDB and Cassandra to create tables and store data.
  • Handled importing data from various data sources, performed transformations using Hive and Map Reduce, streamed using Flume and loaded data into HDFS.
  • Worked in transforming data from map reduce into HBase as bulk operations.
  • Implemented CRUD operations on HBase data using thrift API to get real time insights.
  • Installed Oozie workflow engine to run multiple MapReduce, Hive, Impala, Zookeeper and Pig jobs which run independently with time and data availability.
  • Developed workflow in Oozie to manage and schedule jobs onHadoopcluster for generating reports on nightly, weekly and monthly basis.
  • Used Zookeeper to manageHadoopclusters and Oozie to schedule job workflows.
  • Implemented test scripts to support test driven development and continuous integration.
  • Involved in data ingestion into HDFS using Apache Sqoop from a variety of sources using connectors like JDBC and import parameters
  • Coordination withHadoopAdmin's during deployment to production.
  • Developed Pig Latin Scripts to extract data from log files and store them to HDFS. Created User Defined Functions (UDFs) to pre- process data for analysis.
  • Developing Scripts and Batch Job to schedule variousHadoopProgram.
  • Continuously monitoring and managing theHadoopcluster through Cloudera Manager.
  • Participated in design and implementation discussion for the developing Cloudera 5Hadoopeco system.
  • Used JIRA and Confluence to update tasks and maintain documentation.
  • Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks. Participated in daily scrum and other design related meetings.
  • Created final reports of analyzed data using Apache Hue and Hive Browser and generated graphs for studying by the data analytics team.
  • Used SQOOP to export the analyzed data to relational database for analysis by data analytics team.

Environment: Hadoop, ClouderaHadoop, Map Reduce, Hive, Pig, Sqoop, Flume, HBase, Java, JSON, Spark, HDFS, YARN, Oozie Scheduler, Zookeeper, Mahout, Linux, UNIX, ETL, My SQL.

Confidential, Sacramento, California

HadoopAdmin

Responsibilities:

  • Installed Cloudera distribution ofHadoopCluster and services HDFS, Pig, Hive, Sqoop, Flume and MapReduce.
  • Responsible for providing open source platform based on ApacheHadoopfor analyzing, storing and managing big data.
  • Loaded and transformed large sets of structured, semi-structured and unstructured data.
  • Responsible for managing data coming from different sources.
  • Imported and exported data into HDFS and Hive using Sqoop.
  • Wrote Hive queries.
  • Involved in loading data from UNIX file system to HDFS.
  • Created Hive tables, loaded with data and wrote queries which will run internally in MapReduce and performed data analysis as per the business requirements.
  • Worked with analysts to determine and understand business requirements.
  • Loaded and transformed large datasets of structured, semi structured and unstructured data usingHadoop/Big Data concepts
  • Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer data and financial histories into HDFS for analysis
  • Used MapReduce and Flume to load, aggregate, store and analyze web log data from different web servers.
  • Created MapReduce programs to handle semi/unstructured data like XML, JSON, AVRO data files and sequence files for log files.
  • Involved in submitting and tracking MapReduce jobs using Job Tracker.
  • Experience writing Pig Latin scripts for Data Cleansing, ETL operations and query optimizations of exists scripts.
  • Written Hive UDF to sort Structure fields and return complex data types.
  • Created Hive tables from JSON data using data serialization framework like AVRO.
  • Experience writing reusable custom Hive and Pig UDF's in Java and using existing UDF's from Piggybank and other sources
  • Experience in working with NoSQL database HBase in getting real time data analytics.
  • Integrated Hive tables to HBase to perform row level analytics.
  • Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewingHadooplog files
  • Developed Unit Test Cases for Mapper, Reducer and Driver classes using MR Testing library.
  • Supported operations team inHadoopcluster maintenance including commissioning and decommissioning nodes and upgrades.
  • Provided technical assistance to all development projects.
  • Hands-on experience with Qlik Sense for Data Visualization and Analysis on large data sets, drawing various insights.
  • Created dashboards using Qlik Sense and performed Data extracts, Data blending, Forecasting, and table calculations.

Environment: Hadoop, MapReduce, Yarn, Hive, HDFS, PIG, Sqoop, Solr, Oozie, Impala, Spark, Hortonworks, HBase, ZooKeeper and Unix/Linux, Hue (Beeswax), AWS.

Confidential

Java Developer

Responsibilities:

  • Worked on Spring integration with JSF.
  • Worked on Spring application framework features IOC container and AOP.
  • Worked on JSPs using JSF UI tags, injecting backing beans through spring.
  • Configured faces-config.xml and Spring IOC.
  • Developed screens which consist of jsf validators, jQuery, AJAX and JavaScript, CSS, HTML.
  • Developed applications using Hibernate persistent framework, developed persistent classes, hibernate-mapping files. hbm.xml file, used hibernate query language.
  • Worked on the Webservices classes and WSDL generation.
  • Used Spring Framework for Dependency injection and integrated with ORM framework Hibernate.
  • Developed and Deployed Web services - WSDL and SOAP for getting credit score information from third party.
  • Used xstream forjavaxml binding.
  • Used CVS, Perforce as configuration management tool for code versioning and release.
  • Deployment on WebSphere Application Server 6.0.
  • Designed and developed reporting module using Jasper Reports.
  • Used Log4J to print the logging, debugging, warning, info on the server console.
  • Involved in debugging and troubleshooting related to production and environment issues.
  • Created Test cases, Junit test cases and tested the application thoroughly.
  • Performed E2E Testing.

Environment: Java, JEE Servlet, JSP, Data Structures, Eclipse, JBOSS, XML, JAXB, JSF 2.0, Spring, Hibernate, jQuery, Log4j, ANT, web services, Jasper, Junit, Web Sphere Application Server 6.0, and Oracle, Linux

We'd love your feedback!