We provide IT Staff Augmentation Services!

Data Engineer/big Data Developer Resume

3.00/5 (Submit Your Rating)

Plano, TexaS

PROFESSIONAL SUMMARY:

  • Over 7 years of IT development experience that includes Data Warehousing and Data Analysis, and more than 4 years of experience in Big Data ecosystem using Hadoop framework and related technologies such as HDFS, MapReduce, Hive, Pig, Spark, Hbase, Flume, Oozie, Sqoop, Impala, Kafka, Splunk, and Zookeeper.
  • Excellent knowledge of distributed storages (HDFS) and distributed processing (MapReduce, Yarn) for real - time streaming and batch processing.
  • Experience in developing Python and Spark Data Frames to perform data transformations and build tables in Hive.
  • Interpreted problems and identified areas of improvement using data analysis, data mining, statistics and machine learning techniques.
  • Wrote custom MapReduce programs in Java and Python to extend Hive functionality by writing custom UDFs.
  • Experience in extracting data from RDBMS into HDFS using Sqoop.
  • Worked on collecting real-time streaming data and log data from log collector into HDFS using Flume and Kafka.
  • Experience with NoSQL databases such as HBase for key-based low latency queries.
  • Experience in analyzing data in HDFS through Impala, Hive and Spark.
  • Experience in workflow scheduling and monitoring tools like Control-M and Oozie.
  • Hands-on experience in performing ETL operations using AbInitio environment and creating dashboard reports using Tableau.
  • Hands-on experience in Python, Java, Multi-threaded processing, SQL and PL/SQL.
  • Hands on experience in loading and processing unstructured data (Log files, XML data) into HDFS using Python and Flume.
  • Experience in performance tuning operations using Partitioning, Bucketing and Indexing in HIVE.
  • Hands-on experience with test frameworks for Hadoop using MRUnit framework.
  • Flexible with Unix/Linux Environments working with operating systems like CentOS, Redhat, Ubuntu.
  • In-depth knowledge of database connectivity for databases like Oracle 12c/11g/10g, MS SQL Server 2005/2008, MS Access, DB2 and Teradata.
  • Experience in writing ad-hoc Map Reduce programs in Pig Latin.
  • Experience in performing Spark RDD transformations to process data using Data Frames and Data Pipelines.
  • Experience in working with lambda architecture using Scala on Spark.
  • Worked extensively in developing PySpark scripts generate output files in formats like Avro, JSON and XML.
  • Experience in working with Hive tables in developing data pipelines, implementing complex business logic and optimizing Hive queries.
  • Worked on importing data into HBase using HBase shell and HBase client API.
  • Experience in automating job flows using Oozie.
  • Experience in working with Apache SOLR and Elasticsearch.
  • Worked on loading and unloading data to a set of files in Amazon S3 bucket.
  • Experience in developing shell scripts in UNIX and using SQL or PL/SQL to process data from the input file and load it into the database.
  • Involved in developing generic and custom AbInitio graphs for Unix environment
  • Experience in building a data lake to embrace existing historical data.
  • Worked on pulling relevant raw data from the data lake for analysis according to the requirements.
  • Developed Unix shell wrapper scripts to run graphs in development, testing, and production environments.
  • Experience in planning and scheduling Control-M jobs and automating workloads to reduce incident volume.
  • Excellent understanding of statistics, machine learning, data mining, predictive analysis, data warehousing concepts, data modeling procedures, data profiling, data structures and algorithms.

TECHNICAL SKILLS:

Big Data Ecosystem: HDFS, Map Reduce, Spark, YARN, Pig, Hive, Impala, HBase, Sqoop, Flume, Cloudera Hue, SOLR, Storm, Kafka, Elasticsearch, Oozie, Zookeeper

Languages: Java, Scala, Python, Pig Latin, HiveQL, SQL, PL/SQL, XML, JSON

Database Systems: Oracle 12c/11g/10g, MS SQL Server 2005/2008, MS Access, DB2, Teradata, Greenplum

NoSQL Databases: HBase, Cassandra, MongoDB

IDEs: Eclipse, IntelliJ, Netbeans

Scripting Tools: UNIX Shell Scripting, PERL

Operating Systems: Linux, Unix, Windows 7/Vista/XP/10

ETL Tools: Ab Initio (GDE 3.2.2, Co>Operating System 3.0.4.10), Tableau

Scheduling Tools: Control M, Tidal Enterprise Scheduler, Crontab

PROFESSIONAL EXPERIENCE:

Confidential, Plano, Texas

Data Engineer/Big Data Developer

Responsibilities:

  • Built a generic data ingestion framework to extract data from multiple sources like Oracle, delimited flat files, XML, Parquet, and JSON, using it to build Hive/Impala tables.
  • Built big data analytic solutions to provide near real-time and batch data reports to business users according to requirements.
  • Responsible for design, development and maintenance of workflows to integrate Shell-actions, Java-actions, Sqoop-actions, Hive-actions and Spark-actions into Oozie workflow nodes to run data pipelines.
  • Used Python to parse XML files and create flat files from them.
  • Worked with Spark Data Frames, Datasets and RDDs using Python to transform and load data into Hive tables based on the requirements.
  • Extensively worked with Pyspark/Spark SQL for data cleansing and generating data frames and RDDS.
  • Analyzed SQL scripts for the design and implementation of solutions using PySpark.
  • Worked on creating Spark applications in Scala using map, flatmap, collectByKey, reduceByKey, filter, groupByKey, distinct, cogroup, join, count, collect, reduce functions to process data.
  • Used Impala for low latency queries, visualization and faster-querying purposes.
  • Imported, exported and appended incremental data into HDFS using PySpark and Sqoop from Oracle database and ingested it into Hive tables.
  • Used PySpark to build tables that require multiple computations and non equi-joins.
  • Exported analyzed data to relational databases using PySpark and Sqoop for visualization and to generate reports for the BI team.
  • Used HBase to support front end applications that retrieve data using row keys.
  • Build data quality framework using Java and Impala to run data rules that can generate reports and send emails of business-critical successful and failed job notifications to business users daily.
  • Determine the size of data and the level of computation required to process it and leverage suitable methodologies to transform data and compute aggregations.
  • Handled the design and support of multi-tenancy on our data platform to allow other teams to run their applications.
  • Worked on configuration and automation of workflows using Control-M and led the production support teams through their operational, scheduling and monitoring activities.
  • Created partition tables in Hive for better performance and faster querying.
  • Worked extensively on creating Hive tables to store data that resulted after querying from large data sets.
  • Worked on debugging and performance tuning of Hive and Pig jobs.
  • Knowledge of SQL, Numpy, Pandas, Scikit-learn, and PySpark for data analysis and model building.
  • Worked on file formats and compressed files like snappy, Gzip, Bzip2, avro and text.
  • Processed JSON files using Pyspark and created Hive tables.
  • Developed Python regular expression (re module) operations in the Hadoop/Hive environment.
  • Developed on-the-fly decryption support for Hive, Pig and custom Map Reduce use cases using Java API.
  • Involved in using HCatalog to access Hive metadata from Map Reduce or Pig code.
  • Worked on data pre-processing and cleaning to perform feature engineering and data imputation for missing values in various datasets using Python.
  • Automated jobs for pulling or sending files from and to SFTP servers according to business requirements.

Confidential, Scottsdale, Arizona

Data Engineer

Responsibilities:

  • Developed multiple Map Reduce programs for extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Written Map Reduce jobs to parse and then process the data as per the requirements.
  • Designed and implemented Map Reduce jobs to support distributed processing and provide end to end solutions using Cloudera environment.
  • Implemented automation of workloads using Shell Scripting.
  • Involved in debugging Java Map Reduce programs to process data.
  • Executed different logics on both ORC and text file formats.
  • Built the data lake environment from scratch by pooling data from different sources.
  • Worked extensively on importing and exporting data into HDFS and Hive/Impala tables from Relational Database Systems using Sqoop.
  • Used Flume to collect and store data from different sources into HDFS and later processed it using HiveQL, Pig Latin and Java.
  • Built scalable multi-threaded applications for large data processing using Pig.
  • Developed Pig Latin scripts using DDL and DML to extract data from files and load into HDFS.
  • Worked on HBase whenever required for low latency queries.
  • Used Oozie workflow engine to run multiple Hive/Impala and Pig jobs.
  • Assigned schemas and created Hive tables with partitioning and bucketing for faster analytics and better performance.
  • Implemented Spark RDD transformations and actions to migrate Map reduce algorithms.
  • Developed User Defined Functions (UDFs) to provide custom Hive and Pig capabilities.
  • Used lambda architecture to build large-scale, distributed data processing systems.
  • Created and compared solutions with NoSQL databases and SQL server solutions.
  • Involved in backup activities and preparation of transition documents and ETL (Abinitio) related design and testing.
  • Involved in ETL process migration from AbInitio to Hadoop environments.
  • Involved in data load and unload to and from HDFS using Abinitio using Hadoop read write components and direct Hive scripts.
  • Performed data loads and unloads from DB2 and Oracle databases to Hive using Sqoop scripts.
  • Guided testing teams in the test data preparation to ensure the test coverage including production data scrubbing and test data manipulation using Hive and Pig scripts.
  • Responsible for building and maintaining regression test tool using AbInitio, Hive, and Python script.
  • Worked with support teams to resolve operational and performance issues.
  • Created and compared solutions with NoSQL databases and SQL server solutions.
  • Designed distributed solutions for parallel processing of large data.
  • Worked with cross functional teams to get efficient data results and reports as per the business requirements.
  • Worked with support teams to resolve operational and performance issues.

Confidential, Columbus, Ohio

ETL Developer

Responsibilities:

  • Developed several AbInitio Graphs based on business requirements using various Ab Initio components like reformat, rollup, join, scan, normalize, gather etc.
  • Maintenance of Unix shell scripting for moving AB initio code from one Unix environment to another.
  • Design, Development, testing and implementation of Ab initio graphs and Korn shell scripts.
  • Developed Ab Initio graphs for Data validation using validate components like compare records, compute checksum etc.
  • Implemented Data Parallelism through graphs, by using Ab Initio partition components.
  • Responsible for documentation of complete Graphs and their components.
  • Implemented Data Parallelism through graphs, which deals with data, divided into segments and operates on each segment simultaneously through the Ab Initio partition components to segment data.
  • Involved in Ab Initio Design, Configuration experience in Ab Initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment and data processing at Terabytes level.
  • Involved in automating the ETL process through scheduling.
  • Responsible for cleansing the data from source systems using Ab Initio components such as reformat and filter by expression.
  • Developed Psets to impose reusable business restrictions and to improve the performance of the graph.
  • Developed several partition-based Ab Initio Graphs for high volume data warehouse.
  • Experience in batch scheduling using Control-M Workload Automation.
  • Experience in Control-M version migration and maintaining server integrity to reduce downtime in the production environment.
  • Experience in Data Analysis to mitigate the level of incident frequency by using various incident and change management tools such as Hive, Automic Workload Automation, Control-M, Resolve and HP Service Manager.
  • Co-ordination of various L3 support teams and SME teams to track and document incident data.

Confidential

Business Analyst

Responsibilities:

  • Performed SWOT Analysis of transaction processing in current state system and workflows, and estimated savings for proposed workflow changes and system enhancements.
  • Identified Return on Investment (RoI) opportunities that will reduce costs and increase transaction processing efficiency achieved through automation.
  • Modeled the system using Use Cases and State diagrams, and performed gap, risk, and cost-benefit analysis.
  • Delivered Business Requirement Documentation (BRD) from Use Case specifications, identified goals and objectives, and supported Functional Specification Documentation (FSD) and Data Mapping Documentation.
  • Researched and modeled activity workflows to report business trends for business unit development.
  • Wrote SQL statements to extract data from application back-end and verify it with front-end results.
  • Involved in developing and handling PL/SQL packages, procedures and database triggers.
  • Identified data sources, coordinated data feeds from external sources and the development of reports, queries, and forms to meet the information needs of business partners.
  • Involved in writing business logic and development and modification of a business portal which facilitates end-users to apply for financial licenses online.
  • Generated customer data reports using PL/SQL procedures and handled generation of user profiles.
  • Involved in the development of a Tallyman collection product that allows businesses to manage the recovery of debt from customers.
  • Provided technical support for clients, business partners and end-users.
  • Involved in redefining subscriber profiles using PL/SQL procedures, triggers based on the number of cheque bounces, number of barring activities, etc.
  • Prepared technical specification, interface, and detailed design documents to generate customer status reports using PL/SQL stored procedures in Oracle 11g.

We'd love your feedback!