We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Atlanta, GA

SUMMARY

  • Over 8+ years of IT development experience, including experience in Big Data ecosystem, and related technologies.
  • Implemented various algorithms for analytics using Cassandra with Spark and Scala.
  • Experience in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python/Java intoPig LatinandHQL(HiveQL) and Used UDFs from Piggybank UDF Repository.
  • Experienced in running query - usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
  • Good experience inOozieFramework and Automating daily import jobs.
  • Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
  • Selecting appropriate AWS services to design and deploy an application based on given requirements.
  • Experienced in managing Hadoop clusters and services usingClouderaManager.
  • Experience using Job scheduling tools like Cron, Tivoli and Automic.
  • Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive and MapReduce.
  • Highly experienced in importing and exporting data betweenHDFSandRelational Database Management systemsusingSqoop.
  • Haveexperience ininstalling,configuringandadministratingHadoop cluster for major Hadoop distributions likeCDH4, and CDH5.
  • Experience in working with the Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries
  • Expertise in using various Hadoop infrastructures such asMap Reduce, Pig, Hive, Zookeeper, Hbase, Sqoop, Oozie, Flume, Drillandsparkfor data storage and analysis.
  • Experience in developing a data pipeline through Kafka-Spark API.
  • Proficient in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
  • Good understanding ofNoSQLData bases and hands on work experience in writing applications on No SQL data bases likeCassandraandMongo DB.
  • Experienced in Creating Vizboards for data visualization inPlatforafor real - time dashboard on Hadoop.
  • Collected logs data from various sources and integrated in to HDFS usingFlume.
  • Having good knowledge in Benchmarking & Performance Tuning of cluster.
  • Designed and implemented a product search service using Apache Solr.
  • Good knowledge in querying data fromCassandrafor searching grouping and sorting.
  • Good Knowledge inAmazon Fconcepts likeEMR and EC2web services which provides fast and efficient processing of Big Data.
  • Good experience in Generating Statistics/extracts/reports from the Hadoop.
  • Excellent and experience and knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system

TECHNICAL SKILLS

Programming Languages: Scala, Python, SQL, Java, PL/SQL, Linux shell scripts.

Methodologies: Agile, UML, Design Patterns (Core Java and J2EE)

RDBMS: Oracle 10g/11g/12c, MySQL, SQL server, Teradata

No SQL: Hbase, Cassandra, MongoDB

Big Data/ Hadoop Ecosystem: HDFS, Map Reduce YARN, Hive, Pig, Hbase, Kafka, Impala, Zookeeper, Sqoop, Oozie, DataStax & Apache Cassandra, Drill, Flume, Spark, Solr and Avro

Web/Application servers: Tomcat, LDAP

Cloud Environment: AWS, Microsoft Azure

BI Tools: Platfora, Tableau, Pentaho

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans

Tools: Used: Eclipse, Putty, Cygwin, MS Office

PROFESSIONAL EXPERIENCE

Confidential, Atlanta, GA

Senior Big Data Engineer

Responsibilities:

  • Collecting data from variousFlumeagents that are imported on various servers using Multi-hop flow.
  • Ingest real-time and near-real time (NRT) streaming data intoHDFSusingFlume.
  • Experienced with handling administration activations usingClouderamanager.
  • Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume.
  • Involved in developingImpalascripts for extraction, transformation, loading of data into data warehouse.
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
  • Create anArchitectural solutionthat leverages the best Azure analytics tools to solve our specific need in Chevron use case
  • Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
  • Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Involved in data ingestion intoHDFSusingSqoopfor full load and Flume for incremental load on variety of sources like web server,RDBMSand Data API’s.
  • Worked with NoSQL databases likeHBasein makingHBasetables to load expansive arrangements of semi structured data.
  • Involved in transforming data from Mainframe tables toHDFS, andHBasetables using Sqoop.
  • Acted for bringing in data underHBaseusing HBase shell alsoHBaseclient API.
  • Worked on UDFS using Python for data cleansing
  • Built the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and ‘big data’ technologies likeHadoop Hive, Azure Data Lake storage
  • InstalledOozieworkflow engine to run multipleHiveandPigjobs which run independently with time and data availability.
  • Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Designed and implemented Incremental Imports intoHivetables and writing Hive queries to run onTEZ.
  • CreatedETLMapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
  • Loading Data into HBase using Bulk Load and Non-bulk load.
  • Implemented the workflows usingApache Oozieframework to automate tasks.
  • Involved in migrating tables fromRDBMSintoHivetables usingSQOOPand later generate visualizations using Tableau.
  • Built pipelines to move hashed and un-hashed data from XML files to Data lake.
  • Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
  • Extensively worked with Spark-SQL context to create data frames and datasets to pre-process the model data.
  • Writing pyspark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation
  • UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApache Kafkaclusters.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.

Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, Kafka, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr, Azure, Data Factory, Databricks, HDInsight, PL/SQL, MySQL, Oracle, TEZ

Confidential, St. Louis, MO

Big Data Engineer

Responsibilities:

  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Installing, configuring and maintaining Data Pipelines
  • Administered Tableau server including creating User Rights Matrix for permissions and roles, monitoring report usage and creating sites for various departments
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Developing python scripts for Redshift CloudWatch metrics data collection and automating the data points to redshift database.
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources and destinations
  • UsedKafkaandKafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
  • Developed scripts for loading application call logs to S3 and used AWS Glue ETL to load into Redshift for data analytics team
  • Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.
  • Performance tuning, code promotion and testing of application changes
  • Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features.
  • Experience in building Real-time Data Pipelines withKafkaConnect andSpark Streaming.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.

Environment: Cloudera Manager (CDH5), AWS, S3, EC2, Redshift, Cloudwatch, Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, Kafka, SSIS Snowflake, PyCharm, Scrum, Git, Sqoop, HBase, Informatica, SQL, Python, XML, Oracle, MS SQL, T-SQL, MongoDB, DB2, Tableau, Unix, Shell Scripting.

Confidential - Washington, D.C.

Big Data Engineer

Responsibilities:

  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Developed a PySpark program that writes dataframes to HDFS as avro files.
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
  • Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
  • Hands on experience in using AWS services like EC2, S3, Mongo DB, Nifi, Talend, Auto scaling and DynamoDB
  • Used Elasticsearch for indexing/full text searching.
  • Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
  • WrittenSQL queriesagainstSnowflake.
  • Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
  • Working experience with data streaming process with Kafka, Apache Spark, Hive.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
  • Experience on moving raw data between different systems using Apache NIFI.
  • Involved in loading data from UNIX file system to HDFS using Shell Scripting.
  • Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
  • Created PySpark code that uses Spark SQL to generate data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
  • Performed Data Preparation by using Pig Latin to get the right data format needed.
  • Used python pandas, Nifi, Jenkins, nltk, and textblobto finish the ETL process of clinical data for future NLP analysis.
  • Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
  • Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
  • Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom Map Reduce programs in Java.
  • Designed and implemented effective Analytics solutions and models withSnowflake.
  • Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.

Environment: Hadoop, Hive, AWS, PySpark, Cloudera, MapReduce, Apache, Kafka, Java, Python, Pandas, Pig, Cassandra, Jenkins, Flume, Snowflake, SQL Server, MySQL, PostgreSQL, MongoDB, DynamoDB, Airflow, Unix, Shell Scripting.

Confidential

Data Engineer

Responsibilities:

  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
  • Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
  • Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
  • Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
  • Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.

Environment: Hadoop YARN, Spark, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop, Impala, Tableau, Talend, Oozie, Java, AWSS3, Oracle 11g, Linux

Confidential

Data Analyst

Responsibilities:

  • Involved in analysis, design and documenting business requirements and data specifications. Supported data warehousing extraction programs, end-user reports and queries
  • Created monthly and quarterly business monitoring reports by writing Teradata SQL queries includes System Calendars, Inner Joins and Outer Joins to retrieve data from multiple tables.
  • Developed BTEQ scripts in Unix using Putty and used cron-tab to automate the batch scripts and execute scheduled jobs in Unix
  • Worked on numerous ad-hoc data pulls for business analysis and monitoring by writing SQL scripts.
  • Analyzed and validated data in Hadoop lake by querying through hive tables.
  • Created reports, charts by querying data using Hive Query Language and reported the gaps in lake data loaded.
  • Good knowledge on Json format data and performed the source, target validations using aggregations and null validity functions.
  • Created multi-set tables and volatile tables using existing tables and collected statistics on table to improve the performance.
  • Developed Teradata SQL scripts using RANK functions to improve the query performance while pulling the data from large tables.
  • Experience in performing Dual Data Validation on various Businesses critical reports working with another Analyst.

Environment: Teradata, Hive, Hadoop, Unix, Oracle, SQL, ad-hoc Queries, MS Office, Windows.

We'd love your feedback!